Due to their ability to offer more comprehensive information than data from a single view, multi-view (multi-source, multi-modal, multi-perspective, etc.) data are being used more frequently in remote sensing tasks. However, as the number of views grows, the issue of data quality becomes more apparent, limiting the potential benefits of multi-view data. Although recent deep neural network (DNN) based models can learn the weight of data adaptively, a lack of research on explicitly quantifying the data quality of each view when fusing them renders these models inexplicable, performing unsatisfactorily and inflexible in downstream remote sensing tasks. To fill this gap, in this paper, evidential deep learning is introduced to the task of aerial-ground dual-view remote sensing scene classification to model the credibility of each view. Specifically, the theory of evidence is used to calculate an uncertainty value which describes the decision-making risk of each view. Based on this uncertainty, a novel decision-level fusion strategy is proposed to ensure that the view with lower risk obtains more weight, making the classification more credible. On two well-known, publicly available datasets of aerial-ground dual-view remote sensing images, the proposed approach achieves state-of-the-art results, demonstrating its effectiveness. The code and datasets of this article are available at the following address: https://github.com/gaopiaoliang/Evidential.
translated by 谷歌翻译
In task-oriented dialogs such as MultiWoZ (Budzianowski et al., 2018), an informative and/or successful system response needs to include necessary key information such as the phone number of a hotel. Therefore, we hypothesize that by helping the model to focus more on learning key quantities in the dialog, the model can generative more informative and helpful responses. In this paper, we propose a new training algorithm, Reinforced Language Modeling (RLM), that aims to use a fine-grained reward function and reinforcement learning to help the model focus more on generating key quantities correctly during test time. Empirical results show our proposed RLM achieves state-of-the-art performance on the inform rate, success rate, and combined score in MultiWoZ.
translated by 谷歌翻译
人类的姿势估计旨在弄清不同场景中所有人的关键。尽管结果有希望,但目前的方法仍然面临一些挑战。现有的自上而下的方法单独处理一个人,而没有不同的人与所在的场景之间的相互作用。因此,当发生严重闭塞时,人类检测的表现会降低。另一方面,现有的自下而上方法同时考虑所有人,并捕获整个图像的全局知识。但是,由于尺度变化,它们的准确性不如自上而下的方法。为了解决这些问题,我们通过整合自上而下和自下而上的管道来探索不同接受场的视觉线索并实现其互补性,提出了一种新颖的双皮线整合变压器(DPIT)。具体而言,DPIT由两个分支组成,自下而上的分支介绍了整个图像以捕获全局视觉信息,而自上而下的分支则从单人类边界框中提取本地视觉的特征表示。然后,从自下而上和自上而下的分支中提取的特征表示形式被馈入变压器编码器,以交互融合全局和本地知识。此外,我们定义了关键点查询,以探索全景和单人类姿势视觉线索,以实现两个管道的相互互补性。据我们所知,这是将自下而上和自上而下管道与变压器与人类姿势估计的变压器相结合的最早作品之一。关于可可和MPII数据集的广泛实验表明,我们的DPIT与最先进的方法相当。
translated by 谷歌翻译
由于面向任务导向的对话系统在我们的生活中越来越受欢迎,提出并探索了更现实的任务。然而,出现了新的实际挑战。例如,由于在现有公共数据集中缺少这种情况,当前对话系统无法在查询数据库时有效处理多个搜索结果。在本文中,我们提出了数据库搜索结果(DSR)歧义,这是一个专注于消除数据库搜索结果的新任务,这通过允许它们从多个选项中选择了多个选项而不是只有一个来增强用户体验。为研究这项任务,我们增强了受到流行的面向任务的对话数据集(Multimoz和SGD),转弯,由(a)通过预定义的语法和(b)为子集收集人类释义的(b)来解析歧义。我们发现,我们的增强对话数据的培训提高了模型处理模糊方案的能力,而不会牺牲未修改的转弯。此外,即使在没有域名数据的情况下,也有助于我们的模型帮助我们的模型提高DSR消歧的性能,表明它可以被学习为普遍对话技能。我们的数据和代码将公开可用。
translated by 谷歌翻译
通常观察到的最先进的自然语言技术问题,例如亚马逊alexa和苹果公司,是他们的服务不会因语言障碍而扩展到大多数发展中国家的公民。这种种群因其语言缺乏可用资源来构建NLP产品。本文介绍了allwoz,一个多语言多域面向任务的客户服务对话框数据集覆盖八种语言:英语,普通话,韩语,越南语,印地语,法国,葡萄牙语和泰国。此外,我们通过使用mt5与元学习来创建多语言数据集的基准。
translated by 谷歌翻译
虽然来自X-ray Sinograms的计算机断层摄影(CT)重建是临床诊断所必需的,但成像过程中的碘辐射诱导不可逆损伤,从而驾驶研究人员研究稀疏视图CT重建,即恢复高质量CT图像一套稀疏的一组席克图。建议迭代模型缓解稀疏视图CT图像中出现的伪像,但计算成本太昂贵。然后,基于深度学习的方法由于性能优异和计算而获得了普遍存在。但是,这些方法忽略了CNN的\ TextBF {本地}特征提取功能和Sinogram的\ TextBF {Global}特征之间的不匹配。为了克服这个问题,我们提出\ textbf {du} al- \ textbf {do} main \ textbf {trans}以前(\ textbf {dudotrans}),通过变压器的远程依赖性建模能力同时恢复信息化的中文曲线图和重建CT图像与增强和未加工的叠层图。利用如此新颖的设计,NIH-AAPM数据集和Covid-19数据集上的重建性能实验证实了Dudotrans的有效性和概括性与较少涉及的参数。广泛的实验还展示了具有稀疏视图CT重建的不同噪声级方面的鲁棒性。代码和模型在https://github.com/dudotrans/code上公开使用
translated by 谷歌翻译
在视觉上丰富的文件(VRD)上的结构化文本理解是文档智能的重要组成部分。由于VRD中的内容和布局的复杂性,结构化文本理解是一项有挑战性的任务。大多数现有的研究将此问题与两个子任务结尾:实体标记和实体链接,这需要整体地了解令牌和段级别的文档的上下文。但是,很少的工作已经关注有效地从不同层次提取结构化数据的解决方案。本文提出了一个名为structext的统一框架,它对于处理两个子任务是灵活的,有效的。具体地,基于变压器,我们引入了一个段令牌对齐的编码器,以处理不同粒度水平的实体标记和实体链接任务。此外,我们设计了一种具有三个自我监督任务的新型预训练策略,以学习更丰富的代表性。 Structext使用现有屏蔽的视觉语言建模任务和新句子长度预测和配对框方向任务,以跨文本,图像和布局结合多模态信息。我们评估我们在分段级别和令牌级别的结构化文本理解的方法,并表明它优于最先进的同行,在Funsd,Srie和Ephoie数据集中具有显着优越的性能。
translated by 谷歌翻译
从有限角度范围内获取的X射线投影的计算机断层扫描(CT)重建是具有挑战性的,特别是当角度范围非常小时。分析和迭代模型都需要更多的投影来有效建模。由于其出色的重建性能,深度学习方法已经取得了普遍存在,但此类成功主要限制在同一数据集中,并且在具有不同分布的数据集中不概括。在此,我们通过引入铭顶推销模块来提出用于有限角度CT重建的外推网,这是理论上的合理的。该模块补充了额外的铭顶信息和靴子型号概括性。广泛的实验结果表明,我们的重建模型在NIH-AAPM数据集上实现了最先进的性能,类似于现有方法。更重要的是,我们表明,与现有方法相比,使用这种Sinogram推断模块显着提高了在未经持续数据集(例如,Covid-19和LIDC数据集)上的模型的泛化能力。
translated by 谷歌翻译
Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope.
translated by 谷歌翻译
Face manipulation detection has been receiving a lot of attention for the reliability and security of the face images. Recent studies focus on using auxiliary information or prior knowledge to capture robust manipulation traces, which are shown to be promising. As one of the important face features, the face depth map, which has shown to be effective in other areas such as the face recognition or face detection, is unfortunately paid little attention to in literature for detecting the manipulated face images. In this paper, we explore the possibility of incorporating the face depth map as auxiliary information to tackle the problem of face manipulation detection in real world applications. To this end, we first propose a Face Depth Map Transformer (FDMT) to estimate the face depth map patch by patch from a RGB face image, which is able to capture the local depth anomaly created due to manipulation. The estimated face depth map is then considered as auxiliary information to be integrated with the backbone features using a Multi-head Depth Attention (MDA) mechanism that is newly designed. Various experiments demonstrate the advantage of our proposed method for face manipulation detection.
translated by 谷歌翻译